Picture for Rico Angell

Rico Angell

Mitigating Adaptive Attacks against Reasoning Models with Activation Consistency Training

Add code
May 27, 2026
Viaarxiv icon

On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation

Add code
May 20, 2026
Viaarxiv icon

Estimating Tail Risks in Language Model Output Distributions

Add code
Apr 24, 2026
Viaarxiv icon

Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort

Add code
Oct 01, 2025
Figure 1 for Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort
Figure 2 for Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort
Figure 3 for Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort
Figure 4 for Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort
Viaarxiv icon

Jailbreak Strength and Model Similarity Predict Transferability

Add code
Jun 15, 2025
Figure 1 for Jailbreak Strength and Model Similarity Predict Transferability
Figure 2 for Jailbreak Strength and Model Similarity Predict Transferability
Viaarxiv icon

Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors

Add code
Jun 12, 2025
Figure 1 for Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors
Figure 2 for Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors
Figure 3 for Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors
Figure 4 for Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors
Viaarxiv icon

Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models

Add code
Jul 31, 2024
Figure 1 for Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
Figure 2 for Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
Figure 3 for Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
Figure 4 for Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
Viaarxiv icon

Polynomial Precision Dependence Solutions to Alignment Research Center Matrix Completion Problems

Add code
Jan 08, 2024
Viaarxiv icon

Fast, Scalable, Warm-Start Semidefinite Programming with Spectral Bundling and Sketching

Add code
Dec 19, 2023
Figure 1 for Fast, Scalable, Warm-Start Semidefinite Programming with Spectral Bundling and Sketching
Figure 2 for Fast, Scalable, Warm-Start Semidefinite Programming with Spectral Bundling and Sketching
Figure 3 for Fast, Scalable, Warm-Start Semidefinite Programming with Spectral Bundling and Sketching
Figure 4 for Fast, Scalable, Warm-Start Semidefinite Programming with Spectral Bundling and Sketching
Viaarxiv icon

Efficient Nearest Neighbor Search for Cross-Encoder Models using Matrix Factorization

Add code
Oct 23, 2022
Figure 1 for Efficient Nearest Neighbor Search for Cross-Encoder Models using Matrix Factorization
Figure 2 for Efficient Nearest Neighbor Search for Cross-Encoder Models using Matrix Factorization
Figure 3 for Efficient Nearest Neighbor Search for Cross-Encoder Models using Matrix Factorization
Figure 4 for Efficient Nearest Neighbor Search for Cross-Encoder Models using Matrix Factorization
Viaarxiv icon